JKP factors migration to Python

Introduction

We’re excited to announce the release of the newest version of the code that generates the database used in “Is There a Replication Crisis in Finance?” (Jensen, Kelly, and Pedersen, Journal of Finance 2023), as well as the factor portfolio returns.

Highlights of this release:

  • Migration to Python: We’ve switched from SAS and R to Python.
  • Modular, faster codebase: The code is now more modular and leverages Polars and DuckDB for performance.
  • Single-command execution: The characteristics database and factor returns can be computed in a single Slurm job or a single Python command.
  • Substantial speed-ups: The full database and factor returns can be generated in ~6 hours.
  • Modern output formats: Outputs have moved from CSV to Parquet.

For details on factor definitions and construction, see: Global Factor Data.

Naturally, changing the programming language led to some differences in the underlying algorithms relative to SAS. Nevertheless, most functions and the overall structure of the SAS/R codebase have been retained. The existing SAS/R documentation remains applicable and, in many cases, is more extensive than the current Python documentation.

Output structure

All output is contained in:

SAS-Python-Migrate/build_database/data/processed/

The output is organized as follows:

processed/
  1. accounting_data/
    • quarterly.parquet and annual.parquet contain stock-level accounting characteristics from Compustat data sourced from quarterly and annual filings, respectively.
  2. characteristics/
    • world_data_unfiltered.parquet contains the monthly stock file with stock-level characteristics, without filters.
    • world_data_filtered.parquet contains the monthly stock file with stock-level characteristics filtered by
      primary_sec == 1 (primary security flag), common == 1 (common stock flag), obs_main == 1 (main observation flag), exch_main == 1 (main exchange flag).
    • Partitions of world_data_filtered.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.
  3. other_output/
    • ap_factors_monthly.parquet and ap_factors_daily.parquet contain the returns of FF3 and HXZ4 factor portfolios for every country at monthly and daily frequencies, respectively.
    • market_returns.parquet and market_returns_daily.parquet contain market-portfolio returns for every country at monthly and daily frequencies, respectively.
    • nyse_cutoffs.parquet contains NYSE market-capitalization quantiles used for portfolio construction.
    • return_cutoffs.parquet and return_cutoffs_daily.parquet contain return quantiles used as winsorization thresholds.
  4. portfolios/
    • pfs(_daily).parquet contains portfolios formed at monthly (daily) frequency by sorting stocks into three groups based on non-microcap breakpoints. Portfolio 1 (3) includes stocks with the lowest (highest) value of the characteristic.
    • hml(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency that go long Portfolio 3 (high characteristic values) and short Portfolio 1 (low characteristic values) from pfs.csv.
    • lms(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency following the Jensen, Kelly, and Pedersen (2022) signing convention (e.g., long low-asset-growth, short high-asset-growth).
    • cmp(_daily).parquet contains rank-weighted (characteristic-managed) portfolios formed at monthly (daily) frequency within mega, large, small, micro, and nano capitalization groups in the U.S.
    • country_factors(_daily)/ contains country-by-country lms.parquet files for easier use at monthly (daily) frequency. Files are named using ISO Alpha-3 country codes.
    • regional_factors(_daily)/ contains regional factor portfolios based on lms.parquet at monthly (daily) frequency, constructed following Jensen, Kelly, and Pedersen (2022).
    • clusters(_daily).parquet contains cluster outputs.
    • regional_clusters(_daily)/ contains regional cluster outputs.
    • industry_gics.parquet contains GICS industry-level outputs.
  5. return_data/
    • world_dsf.parquet contains daily stock-level returns for all securities in the database.
    • world_ret_monthly.parquet contains monthly stock-level returns for all securities in the database.
    • daily_rets_by_country/ contains partitions of world_dsf.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.

The SAS-Python-Migrate/build_database/data/processed/ directory also contains two additional subfolders, raw/ and interim/, used by the code to store temporary data for computing stock-level characteristics.

To demonstrate the quality of the new dataset, we compare the factors produced by the SAS/R code with those from the new Python code. Function behavior relative to the SAS portions is close to identical, with changes in standardized_accounting_data for more precise handling of duplicates.

US factor comparison

First, we examine correlations between U.S. long–short factor-portfolio returns produced by SAS/R and Python. In this setting, we aim for a correlation of 1 for each factor.

Code
import polars as pl
from polars import col
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter

# Load data
sas = pl.read_parquet('data/US_factors_SAS.parquet').rename({'date': 'eom'})
py = pl.read_parquet('data/US_factors_py.parquet').sort(['characteristic', 'eom']).rename({'ret_vw_cap': 'ret'})

labels = py['characteristic'].unique().to_list()
label = labels[0]

df_data = []
for label in labels:
    res = {}
    data = (py.filter(col('characteristic') == label).select(['eom', 'ret'])
              .join(
                  sas.filter(col('name') == label).select(['eom', 'ret']),
                  on = 'eom',
                  how = 'inner',
                  suffix = '_sas'
              )
           )
    res['factor'] = label
    res['corr'] = np.corrcoef(data['ret'], data['ret_sas'])[0][1]
    res['mean_py'] = data['ret'].mean()
    res['mean_sas'] = data['ret_sas'].mean()
    res['std_py'] = data['ret'].std()
    res['std_sas'] = data['ret_sas'].std()
    df_data.append(res)

df = pl.DataFrame(df_data).sort('corr')

# --- Split into two halves ---
half = len(df) // 2
df_top = df[:half]
df_bottom = df[half:]

# --- Create figure with 2 rows (equal height) ---
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (9,10), height_ratios=[1, 1], constrained_layout=True)
# --- Top Half ---
ax1.bar(df_top["factor"], df_top["corr"], color="royalblue")
ax1.axhline(y=0.90, color="red", linestyle="--", linewidth=2)
ax1.set_ylabel("Correlation", fontsize=12)
ax1.set_xticks(range(len(df_top["factor"])))
ax1.set_xticklabels(df_top["factor"], rotation=60, ha="right", fontsize=6)
ax1.grid(axis="y", linestyle="--", alpha=0.7)

# --- Bottom Half ---
ax2.bar(df_bottom["factor"], df_bottom["corr"], color="royalblue")
ax2.axhline(y=0.90, color="red", linestyle="--", linewidth=2)
ax2.set_ylabel("Correlation", fontsize=12)
ax2.set_xticks(range(len(df_bottom["factor"])))
ax2.set_xticklabels(df_bottom["factor"], rotation=60, ha="right", fontsize=6)
ax2.grid(axis="y", linestyle="--", alpha=0.7)

plt.show()
Figure 1: Correlation of US factor portfolios produced by Python and SAS/R.

As shown above, all factors have correlations very close to 1.

Next, we compare the mean returns and standard deviations of the factors. In the scatter plots below, the x-axis shows the statistic for SAS/R and the y-axis shows the corresponding statistic for Python. If the two versions match, the points should lie on a 45-degree line through the origin (slope 1, intercept 0).

Code
# Dummy arrays for diagonal lines
line_coords1 = df.sort('mean_sas')['mean_sas'].to_numpy().flatten()
line_coords2 = df.sort('std_py')['std_py'].to_numpy().flatten()

# Create figure with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 5), constrained_layout=True)

# --- Mean return comparison ---
ax1.set_title("Mean return comparison", fontsize=16)
ax1.plot(1.1 * line_coords1, 1.1 * line_coords1, '--', color='red', linewidth=1.5, alpha = 0.7)
ax1.scatter(df['mean_sas'], df['mean_py'], marker='+', color='black', s=80, linewidth=1)
ax1.set_xlabel('SAS/R E[R]', fontsize=14)
ax1.set_ylabel('Python E[R]', fontsize=14)
ax1.grid(True, linestyle='--', alpha=0.7)

# --- Volatility comparison ---
ax2.set_title("Standard deviation comparison", fontsize=16)
ax2.plot(1.1 * line_coords2, 1.1 * line_coords2, '--', color='red', linewidth=1.5, alpha = 0.7)
ax2.scatter(df['std_sas'], df['std_py'], marker='+', color='black', s=80, linewidth=1)
ax2.set_xlabel('SAS/R SD[R]', fontsize=14)
ax2.set_ylabel('Python SD[R]', fontsize=14)
ax2.grid(True, linestyle='--', alpha=0.7)

plt.show()
Figure 2: Comparison of mean and standard deviation of US factor portfolios produced by Python and SAS/R.

Again, the Python and SAS/R results align closely.

World ex-US factor comparison

We now repeat the comparison for the World ex-US region.

Code
import polars as pl
from polars import col
import numpy as np
import matplotlib.pyplot as plt

# Load data
sas = pl.read_parquet('data/WxUS_factors_SAS.parquet').rename({'date': 'eom'}).filter(col('location')=='world_ex_us')
py = pl.read_parquet('data/WxUS_factors_py.parquet').sort(['characteristic', 'eom']).rename({'ret_vw_cap': 'ret'})

labels = py['characteristic'].unique().to_list()
label = labels[0]

df_data = []
for label in labels:
    res = {}
    data = (py.filter(col('characteristic') == label).select(['eom', 'ret'])
              .join(
                  sas.filter(col('name') == label).select(['eom', 'ret']),
                  on = 'eom',
                  how = 'inner',
                  suffix = '_sas'
              )
           )
    res['factor'] = label
    res['corr'] = np.corrcoef(data['ret'], data['ret_sas'])[0][1]
    res['mean_py'] = data['ret'].mean()
    res['mean_sas'] = data['ret_sas'].mean()
    res['std_py'] = data['ret'].std()
    res['std_sas'] = data['ret_sas'].std()
    df_data.append(res)

df = pl.DataFrame(df_data).sort('corr')

# --- Split into two halves ---
half = len(df) // 2
df_top = df[:half]
df_bottom = df[half:]

# --- Create figure with 2 rows (equal height) ---
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (9,10), height_ratios=[1, 1], constrained_layout=True)
# --- Top Half ---
ax1.bar(df_top["factor"], df_top["corr"], color="royalblue")
ax1.axhline(y=0.90, color="red", linestyle="--", linewidth=2)
ax1.set_ylabel("Correlation", fontsize=12)
ax1.set_xticks(range(len(df_top["factor"])))
ax1.set_xticklabels(df_top["factor"], rotation=60, ha="right", fontsize=6)
ax1.grid(axis="y", linestyle="--", alpha=0.7)

# --- Bottom Half ---
ax2.bar(df_bottom["factor"], df_bottom["corr"], color="royalblue")
ax2.axhline(y=0.90, color="red", linestyle="--", linewidth=2)
ax2.set_ylabel("Correlation", fontsize=12)
ax2.set_xticks(range(len(df_bottom["factor"])))
ax2.set_xticklabels(df_bottom["factor"], rotation=60, ha="right", fontsize=6)
ax2.grid(axis="y", linestyle="--", alpha=0.7)

plt.show()
Figure 3: Correlation of World Ex US factor portfolios produced by Python and SAS/R.

Here, factors differ more than in the U.S. sample. However, performance remains strong with all factors having a correlation above 90%.

Code
# Dummy arrays for diagonal lines
line_coords1 = df.sort('mean_sas')['mean_sas'].to_numpy().flatten()
line_coords2 = df.sort('std_py')['std_py'].to_numpy().flatten()

# Create figure with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 5), constrained_layout=True)

# --- Mean return comparison ---
ax1.set_title("Mean return comparison", fontsize=16)
ax1.plot(1.1 * line_coords1, 1.1 * line_coords1, '--', color='red', linewidth=1.5, alpha = 0.7)
ax1.scatter(df['mean_sas'], df['mean_py'], marker='+', color='black', s=80, linewidth=1)
ax1.set_xlabel('SAS/R E[R]', fontsize=14)
ax1.set_ylabel('Python E[R]', fontsize=14)
ax1.grid(True, linestyle='--', alpha=0.7)
ax1.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))

# --- Volatility comparison ---
ax2.set_title("Standard deviation comparison", fontsize=16)
ax2.plot(1.1 * line_coords2, 1.1 * line_coords2, '--', color='red', linewidth=1.5, alpha = 0.7)
ax2.scatter(df['std_sas'], df['std_py'], marker='+', color='black', s=80, linewidth=1)
ax2.set_xlabel('SAS/R SD[R]', fontsize=14)
ax2.set_ylabel('Python SD[R]', fontsize=14)
ax2.grid(True, linestyle='--', alpha=0.7)

plt.show()
Figure 4: Comparison of mean and standard deviation of World Ex US factor portfolios produced by Python and SAS/R.

Overall, the mean and standard deviation of the factor returns are close to those produced in SAS/R.

Comparison of characteristics at stock level

At a more granular level, we can also compare the correlation of the characteristics produced by the SAS portion of the previous version of the code and Python. For that purpose, we compute the Spearman rank correlation of each of the characteristics.

The plot below shows a histogram of the rank correlation:

Code
corr_df = pl.read_parquet('data/sas_vs_py_summ_stats.parquet')
fig, ax = plt.subplots(figsize=(9, 5))

# Histogram (capture counts and patches)
n, bins, patches = ax.hist(
    corr_df["spearman r-corr"],
    bins=10,
    density=False,
    color="royalblue",
    edgecolor="black",
    linewidth=0.5,
    alpha=0.8,
)

# X limits
ax.set_xlim(0.994, 1.0005)

# Labels & title
ax.set_xlabel(r"$\rho$", fontsize=14)
ax.set_ylabel("n", fontsize=14)
ax.set_title("Distribution of Spearman rank correlations", pad=20, fontsize=16)

# Mean line and label
mean_r = corr_df["spearman r-corr"].mean()
ax.axvline(
    mean_r,
    color="red",
    linestyle="--",
    linewidth=1.3,
)

# Put mean text *above* the tallest bar, horizontal
max_count = n.max() if len(n) > 0 else 1
ax.text(
    mean_r,
    max_count * 1.3,       # 30% above tallest bar
    f"Mean = {mean_r:.3f}",
    color="red",
    ha="center",
    va="bottom",
    fontsize=9,
)

# Numbers on top of each bar
for count, patch in zip(n, patches):
    if count <= 0:
        continue  # skip empty bins (log scale hates 0)
    x = patch.get_x() + patch.get_width() / 2
    y = patch.get_height()
    ax.text(
        x,
        y * 1.05,          # slightly above bar
        f"{int(count)}",
        ha="center",
        va="bottom",
        fontsize=8,
    )

# Light grid on y-axis
ax.grid(axis="y", linestyle="--", alpha=0.4)

# Clean up spines
for spine in ["top", "right"]:
    ax.spines[spine].set_visible(False)

# Log scale on y
ax.set_yscale("log")

fig.tight_layout()
plt.show()
Figure 5: Histogram of Spearman rank correlations.

What this shows is that if characteristics are used to sort stocks, the resulting portfolios should be roughly identical to those produced in SAS.

The following table shows a more detailed comparison of the distribution of the characteristics produced by SAS/R and Python.

Code
import pandas as pd
from itables import init_notebook_mode, show

# Make DataFrames render as interactive tables
init_notebook_mode(all_interactive=True)

# Read the Parquet file
df = pd.read_parquet("data/sas_vs_py_summ_stats.parquet").sort_values(['spearman r-corr', 'characteristic'], ascending = [False, False])

# Remove the index (reset and drop)
df = df.reset_index(drop=True)

# Show an interactive, scrollable table
show(
    df,
    scrollY="400px",   # vertical scroll area
    scrollX=True,      # horizontal scroll
    #paging=True,       # enable paging
    #maxBytes=0         # <-- disable downsampling so you see all rows/cols
)
Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

For every characteristic, the table reports Pearson and Spearman correlations between the output of the two pipelines (SAS/R and Python), followed by their respective means, standard deviations, skewness, kurtosis, and key quantiles (1st percentile, 25th percentile, median, 75th percentile, and 99th percentile).

The underlying data for the comparison is available here: sas_vs_py_summ_stats.parquet.

There are only 2 characteristics with Pearson correlation lower than 95%: ‘resff3_6_1’ and ‘resff3_12_1’. Looking at rank correlation and other distributional measures, we can see that even if the Pearson correlation is small, the distributions align closely together and the difference in moments and correlations is due to outliers.

In the new Python code, the ‘div’ and ‘eqnpo’ characteristics in the market_chars_monthly subroutine are coerced to 0 whenever their magnitude falls below 1e-5, in order to eliminate arithmetic noise.

For a fair comparison of characteristic outputs, we have likewise set ‘div’ and ‘eqnpo’ to 0 in the SAS results whenever their magnitued is smaller than 1e-5. However, note that the portfolio return comparisons shown at the beginning were made using the previous SAS output, before applying the zero-floor adjustments to ‘div’ and ‘eqnpo’.

Main sources of differences:

  • Numerical precision: Differences in floating-point handling between SAS and Python can alter inequality evaluations and calculated values.
  • Data revisions: Underlying data may have changed.
  • Algorithmic adjustments: Differences due to updates in standardized_accounting_data for improved duplicate handling.